Optimized Dense Matrix Multiplication on a Many-Core Architecture

نویسندگان

Elkin Garcia

Ioannis E. Venetis

Rishi Khan

Guang R. Gao

چکیده

Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of manycore-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures. In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to 55.2% of the peak performance of the chip. Additionally, measurements of power consumption prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem under consideration.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High Performance Matrix Multiplication on Many Cores

Moore’s Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level parallelism exploited by multi/many-core processors (MCP). Therefore, it is necessary to find out how to build the MCP hardware and how to program the parallelism on such MCP. In this work, we intend to identity the key ar...

متن کامل

Characterization of Intel Xeon Phi for Linear Algebra Workloads

This study focuses on applicability of Intel Xeon Phi coprocessor for some of the Basic Linear Algebra Subprograms (BLAS) subroutines. Based on Many Integrated Core (MIC) architecture, the vector processing unit (VPU) in Xeon Phi coprocessor provides data parallelism at a very fine grain, working on 512 bits of 16 single-precision floats or 32-bit integers at a time. In our work we analyze how ...

متن کامل

Designing Hardware/Software Systems for Embedded High-Performance Computing

In this work, we propose an architecture and methodology to design hardware/software systems for high-performance embedded computing on FPGA. The hardware side is based on a many-core architecture whose design is generated automatically given a set of architectural parameters. Both the architecture and the methodology were evaluated running dense matrix multiplication and sparse matrixvector mu...

متن کامل

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...

متن کامل

Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Optimized Dense Matrix Multiplication on a Many-Core Architecture

نویسندگان

چکیده

منابع مشابه

High Performance Matrix Multiplication on Many Cores

Characterization of Intel Xeon Phi for Linear Algebra Workloads

Designing Hardware/Software Systems for Embedded High-Performance Computing

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

عنوان ژورنال:

اشتراک گذاری